Stemming and Lemmatization with Python and NLTK |
您所在的位置:网站首页 › difference between stemming and lemmatization › Stemming and Lemmatization with Python and NLTK |
Stemming and Lemmatization with Python and NLTK
November 23, 2017 Stemming and lemmatization are essential for many text mining tasks such as information retrieval, text summarization, topic extraction as well as translation. StemmingIt allows us to remove the prefixes, suffixes from a word and and change it to its base form. However, this stem form might not exist in dictionary. Let鈥檚 take a look at how NLTK stems words. Using PorterStemmerPorter stemmer is the most commonly used stemmer because of its good results. #let's import the libraries from nltk.stem import PorterStemmer # the most commonly used stemmer ps = PorterStemmer() print ps.stem("lying"), ps.stem("lies"), ps.stem("lied") lie lie lie Using LancasterStemmerLets compare our results with LancesterStemmer which is based on is based on the Lancaster stemming algorithm. It has more than 120 rules for getting stem words. #let's import the libraries from nltk.stem import LancesterStemmer # the most commonly used stemmer ls = LancesterStemmer() print ls.stem("lying"), ls.stem("lies"), ls.stem("lied") lying lie liedWe can see the difference between the outputs of these two algorithms. There is also SnowballStemmer, which supports other languages besides english. LemmatizationLemmatization is quite similar to stemming, as it also converts a word into its base form. However the root word also called lemma, is present in dictionary. It is considerably slower than stemming becasue an additonal step is perfomed to check if the lemma formed is present in dictionary. Note: We also have to specify the parts of speech of the word in order to get the correct lemma. Words can be in the form of Noun(n), Adjective(a), Verb(v), Adverb(r). Therefore, first we have to get the POS of a word before we can lemmatize it. First let鈥檚 import the libraries. from collections import Counter from nltk.corpus import wordnet # To get words in dictionary with their parts of speech from nltk.stem import WordNetLemmatizer # lemmatizes word based on it's parts of speechOkay, now we have to get the POS of a word. For this pupose, we can use Wordnet corpus. It returns all the POS rating of a word in a list. I have written a function for it. def get_pos( word ): w_synsets = wordnet.synsets(word) pos_counts = Counter() pos_counts["n"] = len( [ item for item in w_synsets if item.pos()=="n"] ) pos_counts["v"] = len( [ item for item in w_synsets if item.pos()=="v"] ) pos_counts["a"] = len( [ item for item in w_synsets if item.pos()=="a"] ) pos_counts["r"] = len( [ item for item in w_synsets if item.pos()=="r"] ) most_common_pos_list = pos_counts.most_common(3) return most_common_pos_list[0][0] # first indexer for getting the top POS from list, second indexer for getting POS from tuple( POS: count )Okay, now lets create the WordNetLemmatizer object and then perform the lemmantization. It lemmatize method takes two arguments, one is the word to lemmatize and second is the POS of the word. words = ["running","lying","cars","m!spleed"] wnl = WordNetLemmatizer() for word in words: print wnl.lemmatize( word, get_pos(word) ), #printing without newline character run lie car m!spleed Difference between Stemming and LemmatizationThe difference between stems and lemmas is that lemmas are present in dictionary and stems might not be present in dictionary. Okay this piece of code for demonstration will use stuff from above. print "Stemming results:", print ps.stem("deactivating"), ps.stem("deactivated"), ps.stem("deactivates") print "Lemmatization results:", words = ["deactivating","deactivated","deactivates"] wnl = WordNetLemmatizer() for word in words: print wnl.lemmatize( word, get_pos(word) ), #printing without newline character Stemming results: deactiv deactiv deactiv Lemmatization results: deactivate deactivate deactivateAlright, that concludes our demonstration for stemming and lemmatization using NLTK in Python. #text-mining |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |